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Abstract 

While there has been a lot of work on finding frequent itemsets in transaction 
data streams, none of these solve the problem of finding similar pairs according 
to standard similarity measures. This paper is a first attempt at dealing with this, 
arguably more important, problem. 

We start out with a negative result that also explains the lack of theoretical 
upper bounds on the space usage of data mining algorithms for finding frequent 
itemsets: Any algorithm that (even only approximately and with a chance of error) 
finds the most frequent fc-itemset must use space Sl(min{m6, n k , (mb/ip) k }) bits, 
where mb is the number of items in the stream so far, n is the number of distinct 
items and ip is a support threshold. 

To achieve any non-trivial space upper bound we must thus abandon a worst- 
case assumption on the data stream. We work under the model that the transactions 
come in random order, and show that surprisingly, not only is small-space simi- 
larity mining possible for the most common similarity measures, but the mining 
accuracy improves with the length of the stream for any fixed support threshold. 

Keywords: algorithms; streaming; sampling; data mining; association rules. 



1 Introduction 

Imagine that we have a set of m sets ("transactions"), each a subset of {1, ... , n}, and 
that we want to find interesting associations among items in these transactions. This 
problem is often framed in a "market basket" model where we are interested in finding 
those pairs of items that are frequently bought together. Whether a pattern is really 
interesting or not is a problem dependent question, and for this reason various similarity 
measures other than number of co-occurrences have been introduced. Some of the most 
common measures are Jaccard [7 j, cosine, and all. confidence (l7j[T9j. Besides these 
measures we are also interested in association rules, which are intimately related to 



the overlap coefficient similarity measure. See 1 12 Chapter 5] for background and 
discussion of similarity measures. 

We initiate the study of this problem in the streaming model where transactions 
arrive one by one, and we are allowed limited time per transaction and very small space. 
The latter constraint implies we cannot hope to store much information regarding pairs 
that are not similar and, moreover, we cannot store the input. In particular, classical 
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frequent item set algorithms such as Apriori [ 1 1 and FP-growth [ 1 3 1 that work in several 
passes over the data cannot be used. The survey of Jiang and Gruenwald 1 14] gives a 
good overview of the challenges in data stream association mining. 

Previous works on transaction data streams have focused on finding frequent item- 
sets, and can be classified in the following way p2) : 

Landmark model The frequent itemsets are searched for in the whole stream, so 
that itemsets that appeared in the far past have the same importance as recent ones; 

Damped model This model is also called Time-Fading. Recent transactions have a 
higher weight than the older ones, so nearer itemsets are considered more interesting 
than the further; 

Sliding window Only a part of the stream is considered at a given time in this model, 
the one falling in the sliding window. This implies storing information concerning the 
transactions falling within the window, since whenever a transaction gets out of the 
window span, it has to be removed from the counts of the itemsets. 

The last two models make the problem of achieving low space usage simpler, since 
most of the information in the stream has little or no effect on the mining result. The 
challenge is instead to handle the real-time requirements of data stream settings. 

All these approaches look for frequent items and do not try to compute any simi- 
larity, relying on the tacit assumption that whatever is frequent is automatically inter- 
esting. This assumption is not always true: 

Example Suppose we have item 1 appearing in 20% of transactions, item 2 appear- 
ing in 20% of transactions, and the pair {1,2} appears in 10% of transactions. Suppose 
moreover that the pair {3, 4} appears in only 5% of transactions and that these trans- 
actions are the only ones in which 3 and 4 appear. The set {1, 2} has a frequency that 
is two times the one of {3, 4}. But looking at the similarity function cosine, we can 
easily realize that s(l, 2) = 10/20 = 0.5 while s(3, 4) = 5/5 = 1. If we base the idea 
of similarity only on frequencies, we are likely to miss the pair {3, 4} which holds a 
much higher similarity than the more frequent pair {1,2}. 

Notice also that {3, 4} holds a higher similarity for all the measures we are ad- 
dressing, so the example shows how frequencies alone do not suffice to infer similarity 
properties of pairs. o 

Our contributions In this paper we address the problem of finding similar pairs in 
a stream of transactions. We first show a negative result, which is that a worst-case 
stream does not allow solutions with non-trivial space usage: To approximate even 
the simplest similarity measure one essentially needs space that would be sufficient to 
store either the number of occurrences of all pairs or the contents of the stream itself. 
Imposing a minimum support ip for the items we are interested in alleviates the problem 
only when cp is close to the number of transactions. 

Theorem 1 Given a constant k > 0, and integers m, n, ip, consider inputs of m 
transactions of total size mk with n distinct items. Let s max denote tlie highest support 
among k-itemsets where each item has support ip or more. Any algorithm that makes a 
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single pass over the transactions and estimates s max within a factor a < 2 with error 
probability 5 < 1/2 must use space f2(min(m, n k , (m/ip) k )) bits in expectation on a 
worst-case input distribution. o 



This lower bound extends and strengthens a lower bound for single-item streams 
presented in |8 1. 

Of course, many data streams may not exhibit worst-case behavior. Several papers 
have considered models of data streams where the items are supposed to be indepen- 
dently chosen from some distribution, or presented in random order [5 9 TT[|2T). We 
present an upper bound that works for a worst-case set of transactions under the condi- 
tion that it is presented in random order, which is sufficient to bypass the lower bound. 
Our method is general in the sense that it can evaluate the similarity of pairs according 
to several well-established measure functions. 

We note that outside the streaming domain, distributed sorting algorithms, such as 
the one built into MapReduce, can be used to permute transactions in random order (by 
using random values as keys). It seems likely that our approach can also be used in a 
1-pass MapReduce implementation. 

Theorem 2 Let S > be constant, and s, M > 1 be integers. We consider a data 
stream of transactions (subsets of{l, . . . n}) of maximum size M, where in each prefix 
the set of transactions appears in random order. For all the similarity measures in 
figure^there is a streaming algorithm (depending on s and M) that maintains a "1±5 
approximation" of the s most similar high-support pairs in the stream, as follows: 
Within the m transactions seen so far, let A be the sth highest similarity among pairs 
{i, j} where both i and j appear at least ip times, where ip can be any function of m. 

There exists L = 0(log(mn)) such that if A > ^ max | mb s M , Mj, then the pairs 

maintained all have similarity at least (1 — S) A with high probability, and all such 
pairs with similarity (1 + 5) A or more are reported. To process a prefix of mb items, 
the algorithm uses time 0(mblog(nm)), with high probability, and space 0(n + s).o 

It is worth noticing that s can be chosen as 0(n), which yields a space usage linear 
in the number of distinct items. Conversely, choosing s smaller does not improve the 
space usage, so we may assume s > n. In absence of a known bound on the maximum 
transaction size, one can use M = n. Then the algorithm guarantees to detect pairs 

with similarity at least ~ max j V mb, n\. Using s > n and ignoring the logarithmic 

factor L this means that up to input size nib — n 2 we can detect similarity n/ip, and 
after this point we can detect similarity %/ mb/<p. Assuming that ip is chosen as a linear 
function of m (relative support threshold), we see that the accuracy improves with the 
length of the stream. 

1.1 Previous work 

Denote by m the number of transactions seen up to the moment in which we want to 
report the similar pairs. Let n indicate the number of distinct items that can appear 
in transactions. Without loss of generality we can assume these items are in the set 
{1, . . . , n}. Parameter b is the average length of transactions (such that mb is the size 
of the data set seen so far). 
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Most of the algorithms we describe actually consider the problem of finding fre- 
quent objects in a stream of items, so they do not focus on itemsets, like we do. But 
given a stream of transactions we can of course generate the stream of all pairs oc- 
curring in these transactions, and feed them to a frequent item algorithm. (We do not 
consider here that this might not be possible for large transactions in settings where 
real-time constraints are important.) In the following we let M2 denote the length of 
the derived stream of pairs. 



Landmark model 

Many papers have addressed the problem of frequent items in a stream. Starting from 
the seminal paper [2| streaming algorithms have started to flow in recent years. Many 
important contributions to the problem of frequent items (and indirectly frequent item- 
sets) have thus been presented. 

In several independent papers ||9][T5][18) algorithms have been presented that can 
find all pairs with support at least k using space M 2 /(k — 1) and constant time per pair 
in the stream. These algorithms may generate false positives, i.e., it is only known that 
the output will contain the frequent pairs. 

Cormode and Muthukrishnan 1 8 1 consider the problem of reporting hot items in a 
fully dynamic database scenario. The space usage is similar to the schemes above, but 
the error probability can be reduced arbitrarily (at the cost of space). 

Also in pi is a lower bound on the number of bits of memory necessary in order to 
answer queries that concern reporting the items with frequencies over a certain thresh- 
old. This lower bound is extended and generalized by our lower bound in theorem[T] 

In (6| the Count Sketch algorithm tackles the problem of reporting the k most 
frequent itemsets. For worst-case distributions their algorithm has similar performance 
to those mentioned above, but for skewed distributions they are able to detect itemsets 
with smaller frequencies in the same amount of space. 



A false negative approach 

Yu et al. pT[ present algorithms directly addressing the problem of finding frequent 
itemsets in a transaction stream. The algorithm does not find itemsets that are similar 
by means of measure functions other than support. Under the assumption that items 
occur independently (which is arguably quite strong, since we are assuming that there 
may be dependencies resulting in frequent sets) the authors show upper bounds on 
space usage similar to those of |8|. The performance is tested on artificial data sets 
where the independence assumption holds. For itemsets of size two (or more) the 
paper lacks a theoretical analysis of the proposed algorithm, but claims an empirical 
space usage bounded by m 3 /fc 3 . 



Sampling according to the similarity 

Our algorithms builds on top of an idea presented in [3 4] . The sampling technique used 
in that algorithm is such that pairs are sampled a number of times that is proportional 
to their similarity. (A more technical explanation can be found in section 3.1 where we 
improve the sampling procedure to make it suitable for a streaming environment.) The 
algorithms presented in [3,4] have near-optimal running time, when no information on 
the distribution of similarities are given. As a matter of fact, the running time is linear 
in the size of the input and output (when there are many pairs of roughly the same 
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similarity). The methods presented are highly general and apply to many measure 
functions that are linear in the number of occurrences of a pair. However, the method 
does not directly apply to a streaming setting since it needs two passes over the data. 

2 Lower bound 

There are two naive approaches to handling fc-itemset support counting in a data stream 
setting: One consists in storing all the transactions seen (possibly trying to compress 
the representation), and the other one maintains support counts for all fc-itemsets seen 
so far. 

Theorem [T] says that it is not possible to beat the best of these approaches in the 
worst case (with support threshold ip = 1). The proof is a reduction from communica- 
tion complexity: 

Proof. The inputs considered for the lower bound have m transactions of size k. Let 
n' = min(n, [mk/(2ip)\ ) — 1 be the largest possible number of items that can appear 
ip times in m/2 transactions, minus 1. We pick an arbitrary set F of n' items, and will 
form an input stream that consists of two parts: 

• In the first m/2 transactions we ensure that each item in F appears tp times or 
more, while no fc-subset of F appears. This can be done by putting one item not 
in F in each transaction. 

• In the last m/2 transactions we encode information that will require many bits 
to store, as detailed below. 

Consider the first s = min(m/2, ( n k )) transactions in the second part. Since s < (™ ) 
we can map the numbers {1, . . . , s} to unique fc-itemsets in F. In particular, any bit 
string x G {0, 1} S can be mapped to the unique set of transactions corresponding to the 
positions of Is in x. In this data set, each fc-itemset from F appears at most once. 

Suppose we have an algorithm that can determine the support of the most frequent 
itemset within a factor a < 2 with probability 1 — S. This implies that, on inputs 
where no itemset appears more than twice, the algorithm can distinguish (with proba- 
bility 1 — S) the cases where the most frequent itemset appears once and twice. Given 
x E {0, 1} S we consider the memory configuration after the algorithm has seen the 
set of transactions that correspond to x. This can be seen as a "message" that en- 
codes sufficient information on x that allows us to determine if one of the itemsets we 
have seen appears later in the stream. Lower bounds from communication complexity 
(see p6| Example 3.22]) tell us that even when we allow error probability 5 < 1/2 the 
amount of communication to determine whether x, y G {0, 1} S have a 1 in the same 
position (corresponding to the same fc-itemset appearing twice) is O(s) bits in expec- 
tation. This means that the memory representation (even if it is compressed) must use 
fi(s)bits. Using the estimate ("') > min(("), ( m % (3<p) )) = n(min(n fe , (m/(p) k )) 
we get the lower bound stated in the theorem. □ 



Corollary 3 Any deterministic algorithm that determines the highest support in a trans- 
action data stream must, after having processed transactions of total size mb, use space 
fi(min(m&, n k )) bits on a worst-case input. o 
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Figure 1: Measures that we cover with our algorithm and the corresponding functions. 
The overlap coefficient measure has the property that finding pairs having similarity 
over a certain threshold implies finding all association rules with confidence over that 
the same threshold. As argued in [3. 4], Jaccard similarity can be handled via dice 
similarity. 



previous counts 




Figure 2: Overview of the algorithm with all its components. 

3 Our algorithm 

We present a new algorithm for extracting similar pairs from a set of transactions using 
only one pass over the data. The algorithm is approximate, so false negatives and 
false positives occur. Most of our discussion will concern space usage, but we are also 
aiming for very low per-item time complexity of the algorithm. In particular, we will 
not allow anything like iterating through all pairs in a transaction. 

The measures we will address are reported in Figure[T] and are all symmetric. This 
means that we are interested only in looking at pairs (i,j) where i < j. For this reason 
we will use set notation for the pairs, so instead of we will write {i,j}. 

Parameters of the algorithm We recall that ip is the item support threshold, and M 
is the maximal transaction size. Increasing cp will decrease the minimum similarity the 
algorithm will be able to spot. M is a characteristic of the transactions, supplied as a 
parameter to the algorithm. In absence of a known bound on M, one can set M = n. 
The parameter s determines the space usage of the algorithm, which is 0(n + s) words. 
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Notation In the streaming framework, the total number of transactions is not known. 
In order to address this issue, we consider sets of transactions, prefixes, of the stream 
of increasing size. Suppose that so far we have seen m transactions T\ , . . . , T m C 
{l,...,n}. 

The current prefix has length 2*, t G NU{0} when m falls in the interval [2*,2 t+1 ). 
Our algorithm maintains counts of all items and store copies of the counts every time 
the current prefix changes (that is: every time the number of transactions seen is two 
times the length of the current prefix). Each time the current prefix changes, we update 
our estimate of the most similar pairs, and use this estimate until the next change of 
current prefix. 

The algorithm is based on two pipelined stages: a stream of pairs generation phase 
and a store and count phase. We will describe the two phases separately, since the 
output of the former phase will constitute the input of the latter. Figure [2] gives an 
overview of the algorithm. 

The prefixes of the stream are fed to a pair sampling stage that uses the stored 
counts from the previous prefix to compute sampling probabilities. Given the current 
prefix, the counts relative to that prefix will be used in order to sample pairs in the 
stream, until a new set of counts is stored for the prefix of length 2 t+1 The idea is 
that, since transactions come in random order, the sampling probabilities should be ap- 
proximately the same as for the BlS AM sampling procedure (which bases the sampling 
probabilities on exact item frequencies). 

In section [4] we show how this technique samples, with high probability, the pairs 
having a high enough similarity. In fact, we show that a stronger property holds with 
high probability: Even when we split the stream into k chunks, each with the same 
number of transactions, we will sample these pairs sufficiently often in each chunk to 
reliably estimate their similarity. 

3.1 Pair sampling 

We base our technique on the sampling method of the BiSam algorithm (3][4). For 
each transaction the pairs are sampled according to their support, such that the pair 
{i,j} is sampled with probability r/(|,5j|, \Sj\), where / is a function that depends 
on the similarity measure considered, and r is a parameter that is used to control the 
sampling rate. We fix r = where the number of chunks k is given by equation ([5J. 

BiSam idea The idea is that after both i and j have appeared tp times, the expected 
number of times {i,j} is sampled is proportional to s(i, j). Also, the number of sam- 
ples follows a highly concentrated (binomial) distribution, so the true similarity can be 
estimated reliably for pairs that are sampled sufficiently often. For any / that is non- 
increasing in both parameters, the BiSam algorithm performs the sampling in time 
that is expected linear in the transaction size plus the number of samples. However, the 
time to process a transaction may be quadratic with non-negligible probability, which 
is problematic for application in a streaming context. We refer to (3]|4j for details. 

Streaming adaptation Two things allow us to arrive at a version suitable for stream- 
ing: 

• While BiSam produces dependent samples, in the sense that the number of times 
two different itemsets is sampled is not independent, we show how to make the 
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samples produced independent. This will ensure that the number of samples 
from each transaction is highly concentrated around its expectation. 

• The requirement of minimum support <p will ensure that processing of a single 
transaction takes "linear time with high probability." More precisely: Any set of 
consecutive transactions with a total of log m items will require linear time with 
high probability. 

To achieve independence we will change the sampling probabilities by rounding 
them down to the nearest negative power of 2. This means that the expected number of 
times is sampled is no longer exactly proportional to s(i,j), but is changed by 

a factor 7jj € [1, 2]. However, since the sampling probability is known, which means 
that jij will be constant for any given we can still use the sample counts to 

reliably estimate similarity. 

Details For a transaction T t we can visualize the pairs in T t x T t as a 2-dimensional 
table, with rows and columns sorted by support, where we are interested in the pairs 
below the diagonal (index i < j). Since / is non-increasing the sampling probabilities 
are decreasing in each row and column. This means that for any k > 0, in time 
(3(|7t|) we can determine what interval in each row of the table is to be sampled with 
probability 2~ k . To produce the part of the sample for one such interval, we describe a 
method for producing a random sample of S = {1, . . . , cf>}, for a given integer cf), where 
each number is sampled with the same probability p. Since p<f) may be much smaller 
than <j>, we want the time to depend on the number of samples, rather than on <fi. This 
can be achieved using a simple recursive procedure similar to the one used in efficient 
implementations of reservoir sampling: With probability (1 — p)^ we return an empty 
sample. Otherwise, we choose one random element x from S, and recursively take a 
sample of the set 5 , \{x} with sampling probability p. The set S can be maintained in an 
array, where sampled numbers are marked. In case more than half of the numbers are 
marked, we construct a new array containing only unmarked numbers (the amortized 
cost of this is constant per marking). To select a random unmarked number we sample 
until one is found, which takes expected 0(1) time because no more than half of the 
numbers are marked. 

In summary, for each sampling probability 2~ fc we can compute the corresponding 
part of the sample in expected time O ( | | + Zk), where Zk is the number of samples. 
This is done for k = 1, 2, . . . , 2 log(nm). Sampling probabilities smaller than (nm)~ 2 
are ignored, since the probability that any such pair would be sampled in any trans- 
action is less than 1 jm. That is, with high probability ignoring such pairs does not 
influence the sample. To state our result, let 2 _N denote the set of negative integer 
powers of 2. 

Lemma 4 Let f : N x N -> 2~ N be non-increasing in both parameters. Given a 
transaction T t and support counts \ Si \ for its items, in expected time O ( | T t | log (nm) + 
z) we can produce a random sample of z 2-subsets ofT t such that: 

• {i, j} is sampled with probability \Sj\) if f(\Si\, \Sj\) > (nm)~ 2 , and 
otherwise with probability 0, and 

• the samples are independent. o 

For all similarity measures in figure[T]and any feasible value of r, the minimum support 
requirement will ensure that the expected number of samples in a transaction is at most 
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\T t \. This means that for each transaction T t , the time spent is 0(|T t | log(nm)) with 
high probability. 

3.2 SampleCount 

This phase sees the stream of pairs generated by the pair sampling, and has to filter out 
as many low similarity pairs as possible, while successfully identifying high similarity 
pairs. By the properties of pair sampling, this is essentially the task of identifying 
frequent pairs in the stream of samples. We aim for space usage that is smaller than 
that of standard algorithms for frequent item mining in a data stream. In order to 
accomplish this we use a modification of an algorithm by Demaine et al. |9). Their 
algorithm finds frequent items in a randomly permuted stream of items, and so does 
not directly apply to our setting where only the transactions are assumed to come in 
random order. Demaine et al. are able to sample random elements by simply taking 
the first elements from the stream. This would not work in our setting, where all these 
elements might be pairs coming from the same transaction. 

Reservoir sampling Instead, we use a reservoir sampling method 1 20 1 . We sketch 
the mechanism here and we refer to the original paper for a complete description. 
Suppose we have a sequence of d items and we want to sample a random subset of 
the sequence. We first of all put in the sample the first s elements that we see. For each 
subsequent element, in position t > s, we will put it in the sample with probability s/t. 
When a new element has to be included in the sample, another one that is already part 
of the sample has to be evicted. Each element of the set of samples will be chosen as 
the victim with probability 1/s. This technique ensures we will end up with a set of 
samples that is a true random sample of size s. 

SampleCount We consider the stream of pairs divided into k chunks. The pair sam- 
pling generates these chunks such that each chunk corresponds to some set of transac- 
tions (i.e., all the pairs sampled from each transaction end up in the same chunk). 

We run reservoir sampling on every other chunk to produce a truly random sample 
of size s/2. We then proceed to count the occurrences of the elements of the sample 
in the next chunk. Assume in the following that we number chunks by [«], such that 
reservoir sampling is done on even-numbered chunks, indexed by [K even ] . 

When doing the above, whenever we see a pair {i, j} whose count must be updated, 
we weigh the sample by the factor 7, j that got "lost" during the pair sampling phase, so 
as to consider an expected number of samples exactly proportional to s(i, j). At the end 
of a counting chunk we estimate the similarities of all pairs sampled, and keep the s /2 
largest similarities seen so far. At the end of the stream the similarity estimates found 
are returned to supersede the previous estimates. Pseudocode for the SampleCount 
algorithm is shown in figure [TJ 

4 Analysis 

Let Si denote the set of transactions containing the element i. This means that SiCiSj is 
the set of transactions containing the pair {i, j}. Let Sj denote the set of transactions 
containing i in the current prefix of the stream. Similarly, will denote the set of 
transactions containing i in Ck, the chunk k of the suffix of the stream up to the point in 
which a new current prefix changes the counts of items occurrences. So = Si n Ck 
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Algorithm 1 Pseudocode for the SampleCount phase. 



1 


procedure SampleCount(P, s, size) t> 




P is a stream of pairs, each of which has associated a similarity value. The 




length of P is known. 


2 


s OM <- 


3 


while There are elements in P do 


4 


S" <- 


5 


S-^0 


6 


i <- 


7 


5 <— the first s/2 elements in P 


8 


while (t < ^ - s/2) do 


9 


i <— the next element in P 


10 


Choose uniformly at random a number re [0,1] 


11 


if r < s/{s + 2t + 2) then 


12 


Choose uniformly at random a victim from S and substitute it with 


13 


% 

end if 


14 


t<-t+l 


15 


end while 


16 


initialize^',,?) > 




5" is an associative array indexed on the distinct items present in S; initializing it 




means putting all its entries to 


17 


while (t < size) do 


18 


i <— the next element in P 


19 


if i e S then 


20 


S'(*)«-S , (i)+7 < 


21 


end if 


22 


t<-t+l 


23 


end while 


24 


Choose the s topmost distinct items between S' out and S', and assign them 




to 5^ 


25 


end while 


26 


Return S' mt 


27 


end procedure 



10 



Definition 5 Given x,y € R we say that x (S, L)- approximates y, written x ~ y, if 
and only if x > L implies x G [(1 — S)y; (1 + S)y]. o 

The notation extends in the natural way to approximate inequalities. 

In what follows we will use (5, L)-approximations, where L — C log(mn) for a 
suitably large constant C (depending on the accuracy 5 in Theorem [2]). The task is 
to analyze the accuracy of the new approximation computed when the current prefix 
changes. We introduce two random events, GoodPermutation (GP) and Good- 
BisamSample (GBS), and bound the probability that they do not happen. 

A permutation of the transactions is called good for {i, j}, denoted GPjj, if and 
only if the following conditions hold (for the current prefix): 

1. |^|~|^|/2 and |^|^|^|/2; 

2. Vfc.is.fns*! \SinSj\/2k; 

Essentially, goodness means that the frequencies of individual items are close in the 
first and second half of the current prefix and the frequency of the pair is evenly spread 
over the chunks in the second part of the current prefix. 

Lemma 6 Given S E [0; 1] C R, we have: 

-|S;|<5 2 

Pr[GP M ] > 1 - 6 • e—^~ 

Proof. An interesting property of the random variables \S}\ and \S"? D Sj \ is that 
they are negatively dependent [ 10 1 . In a nutshell, the random variables in the vector 
it = (X\, . . . , X n ) are negatively dependent if and only if for every two disjoint sets 
I, J C {1, . . . , n} and for every pair of both nonincreasing or both non decreasing 
functions to : Rl'l ^ R, g ■ Rl J l ^ R, it holds that E[f(X h i € I)g(Xj,j € J)] < 
E[f(Xi, i £ I)]E[f(Xj,j £ J)]. We will use this property later on in the proof. First 
of all we bound the probability that \Sj \ is far from |S*i/2|. Using Chernoff bounds we 
can write: 

Pr^l - |5j|/2| < S\Si\/2] < 2 • e"^ (1) 
Looking at \Sjf n Sj \ we can write: 

Pr[|Sf n fif*| - \Si n Sj\/2k\ < S\Si n S,-|/2«] < 2 • e~ ' S ' n ^'" (2) 

We use the fact that Chernoff bounds also holds for negatively dependent random vari- 
ables. Since the last bound is the weakest of the three, the lemma follows. □ 

We want GPij to hold with probability 1 — o(l/n 2 ) whenever items i and j both 
have support tp. From Lemma|6]we get that this holds if \Si C\Sj\ > Cn log n, for some 
constant C (depending on 5). If s(i,j) > 2nLf((p, <p) > nL/ip then \Si D Sj\ > 2kL. 
Hence, a sufficient condition for the similarity is 

s(i,j) > nL/ip . (3) 

It remains to understand what is the probability that, given a good permutation, the 
pair sampler will take a number of samples for a given pair in each chunk fc that leads 



11 



to a (1 ± (^-approximation of s(i, j). We denote the latter event by GBSi j,k, and want 
to bound the quantity P^GBSij^lGPij]. 

For this purpose consider the random variable Xijk defined as the number of times 
we sample the pair {i,j} in chunk k. Assuming GPjj we have that (over the random- 
ness in the pair sampling algorithm) E[Xij r k] — f(\S}\, \Sj\)T\Si H Sj\/2n. Since 
the occurrences of {i,j} are independently sampled, we can apply a Chernoff bound 

to conclude -Xjj.fc — E[Xi^,k\- This leads to the conclusion: 

Lemma 7 X^ k « /(|^|, \S)\)t\S % n ^|/2 K 

Suppose that Xij ^ is close to its expectation. Then we can use it, with (1 ± 5)- 
approximations of \Si\ and \Sj\, to compute a (1 ± 0(<5))-approximation of s(i,j). 
This follows by analysis of the concrete functions / of the measures in Figure [T| 

A sufficient condition on the similarity needed for a (l±<5)-approximation of ATj j fc 
can be inferred from lemma|7] If s(i,j) > AkL/t then E[Xij,k] > j)t/4k > L. 
So it suffices to enforce: 

s(i,j) >AkL/t. (4) 

In order to have 0(mb) pairs produced by the pair sampling phase, we will choose 
t = Atp/M. The expected number of pair samples from T t is less than \T t \ 2 Tf(ip, ip), 
using that f is decreasing. For all measures we consider, f((p, (p) < 1/ip, so \T t \ 2 Tf((p, ip) < 
\T t \ 2 /M< \T t \. 

It remains to understand which is the probability that a pair of items, each with 
support at least <p, is not sampled by SampleCount. Let the random variable X. t . t k 
represent the total number of samples taken in chunk fc. The probability that a {i, j} is 
sampled in chunk k is X$ j k/X k, so the probability that it does not get sampled in 
any (even-numbered) chunk is rifce[K ] — ^(j.fe/^.. .fe) s - We have seen before that 

5,L 

Xij.k > s(i,j)r/4:K. For what concerns X.fc using a Chernoff bound we can get: 

X, i : k ~ E[X . fc] < mb/n, using the linear upper bound on the number of samples. 
So we can compute: 



Yl (i-Xi^/x.^y < (i 



/ ■ -\ \ sk/2 
S{1,J)TK \ 

2Kjijmb 
s(i,j)rsK 



8mb 



In order for this probability to be small enough (0(1/ m 2 )), we need to bound the 
similarity to 

. . 8mbL 
s(i,j)> (5) 

SKT 

To choose the best value of k we balance constraints Q and Q, getting: 



From which we can deduce: 



kL vnbL I mbM 

(6) 




(7) 
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Figure 3: Plots of the ratios |^|/|Si| and \S} H Sj\/\Si n S 3 \. 

5 Dataset characteristics 

We have computed, for a selection of the datasets hosted on the FIMI web pag^J the 
ratios between the number of occurrences of single items and pairs in the first half of the 
transactions and the total number of occurrences of the same items or pairs. The values 
of some of this ratios, the most representative, are plotted figure[3] on the x-axis items 
or pairs are spread evenly, after they have been sorted according to their associated 
ratio. The y-axis represents the value of the ratios. We have taken into account only 
items and pairs whose support is over 20 occurrences in the whole dataset, in order 
to avoid the noise that could be generated by very rare elements. As we can see, the 
number of occurrences and co-occurrences are not so far from what would be expected 
under a random permutation of the transactions. The synthetic data set behaves exactly 
like we would expect under a random permutation, with the ratio being very close to 
1/2 for almost all items/pairs. 

This means that even for real data sets, where the order of transactions is not ran- 
dom, the sampling probabilities used in the pair sampling are reasonably close to the 
ones that would be obtained under the random permutation assumption. 

6 Conclusions 

We presented the first study concerning the problem of mining similar pairs from a 
stream of transactions that does rely on the similarity of items and not only on the 
frequency of pairs. The structure of the problem is studied and exploited in order 
to highlight a result of non possibility and show a suitable algorithm that is fast and 
space-efficient. A thorough experimental study of (carefully engineered versions of) 
the presented algorithm remains to be carried out. 

An interesting open question is to extend the lower bound presented in section [2]to 
our case of study, in which the transactions are given in random order. 

References 

[1] Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining associ- 
ation rules in large databases. In Proc. International Conference On Very Large 

'http : / / f imi . cs . helsinki . f i/ 



13 



Data Bases (VLDB 1994), pages 487-499. Morgan Kaufmann Publishers, Inc., 
September 1994. 

[2] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approxi- 
mating the frequency moments. J. Comput. Syst. Set, 58(1): 137-147, 1999. 

[3] Andrea Campagna and Rasmus Pagh. Finding associations and computing sim- 
ilarity via biased pair sampling. In Proc. 9th IEEE International Conference on 
Data Mining (ICDM 2009). 

[4] Andrea Campagna and Rasmus Pagh. Finding associations and computing simi- 
larity via biased pair sampling. Invited for publication in Knowledge an Informa- 
tion Systems, 2010. 

[5] Amit Chakrabarti, Graham Cormode, and Andrew McGregor. Robust lower 
bounds for communication and stream computation. In Cynthia Dwork, editor, 
STOC, pages 641-650. ACM, 2008. 

[6] Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items 
in data streams. Theor. Comput. Sci., 312(1):3— 15, 2004. 

[7] Edith Cohen, Mayur Datar, Shinji Fujiwara, Aristides Gionis, Piotr Indyk, Rajeev 
Motwani, Jeffrey D. Ullman, and Cheng Yang. Finding interesting associations 
without support pruning. IEEE Trans. Knowl. Data Eng, 13(l):64-78, 2001. 

[8] Graham Cormode and S. Muthukrishnan. What's hot and what's not: tracking 
most frequent items dynamically. ACM Trans. Database Syst., 30(l):249-278, 
2005. 

[9] Erik D. Demaine, Alejandro Lopez-Ortiz, and J. Ian Munro. Frequency estima- 
tion of internet packet streams with limited space. In Proc. 10th Annual European 
Symposium Algorithms (ESA 2002), pages 348-360, 2002. 

[10] Devdatt Dubhashi and Desh Ranjan. Balls and bins: a study in negative depen- 
dence. Random Struct. Algorithms, 13(2):99-124, 1998. 

[11] Sudipto Guha and Andrew McGregor. Stream order and order statistics: Quantile 
estimation in random-order streams. SIAM Journal on Computing, 38(5):2044- 
2059. 

[12] Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques, 2nd 
edition. Morgan Kaufmann, 2006. 

[13] Jiawei Han, Jian Pei, Yiwen Yin, and Runying Mao. Mining frequent pat- 
terns without candidate generation: A frequent-pattern tree approach. Data Min. 
Knowl. Discov, 8(l):53-87, 2004. 

[14] Nan Jiang and Le Gruenwald. Research issues in data stream association rule 
mining. SIGMOD Record, 35(1):14-19, 2006. 

[15] Richard M. Karp, Scott Shenker, and Christos H. Papadimitriou. A simple algo- 
rithm for finding frequent elements in streams and bags. ACM Trans. Database 
Syst, 28:51-55, 2003. 



14 



[16] Eyal Kushilevitz and Noam Nisan. Communication complexity. Cambridge Uni- 
versity Press, New York, 1997. 

[17] Young-Koo Lee, Won- Young Kim, Y. Dora Cai, and Jiawei Han. Comine: Effi- 
cient mining of correlated patterns. In Proc. IEEE International Conference on 
Data Mining (ICDM 2003), pages 581-584. IEEE Computer Society, 2003. 

[18] Jayadev Misra and David Gries. Finding repeated elements. Sci. Comput. Pro- 
gram.^): 143-152, 1982. 

[19] Edward Omiecinski. Alternative interest measures for mining associations in 
databases. IEEE Trans. Knowl. Data Eng, 15(1):57— 69, 2003. 

[20] Jeffrey Scott Vitter. Random sampling with a reservoir. ACM Trans. Math. Softw., 
ll(l):37-57, 1985. 

[21] Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Zhenjie Zhang, and Aoying Zhou. A 
false negative approach to mining frequent itemsets from high speed transactional 
data streams. Inf. Sci., 176(14): 1986-2015, 2006. 

[22] Yunyue Zhu and Dennis Shasha. Statstream: Statistical monitoring of thousands 
of data streams in real time, pages 358-369. Morgan Kaufmann, 2002. 



15 



